Spatial Data Workshop

2023 EPA R User Group Workshop

Marc Weber

EPA (USA)

Michael Dumelle

EPA (USA)

10/18/23

Welcome!

  1. Please visit this link to view the workshop’s accompanying workbook

  2. Install and load R packages by visiting this link

  3. (Optional) Download the workshop slides (instructions in the workbook’s “Welcome” page)

  4. Follow along and have fun!

Who Are We?

Marc Weber is a geographer at the Pacific Ecological Systems Division (PESD) at the United States Environmental Protection Agency (USEPA). His work supports various aspects of the USEPA’s National Aquatic Resource Surveys (NARS), which characterize the condition of waters across the United States, as he helped develop and maintains the StreamCat and LakeCat datasets. His work focuses on spatial analysis in R and Python, Geographic Information Science (GIS), aquatic ecology, remote sensing, open source science and environmental modeling.

Who Are We?

Michael Dumelle is a statistician for the United States Environmental Protection Agency (USEPA). He works primarily on facilitating the survey design and analysis of USEPA’s National Aquatic Resource Surveys (NARS), which characterize the condition of waters across the United States. His primary research interests are in spatial statistics, survey design, environmental and ecological applications, and software development.

Disclaimer

The views expressed in this workshop are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency or the U.S. National Oceanic and Atmospheric Administration. Any mention of trade names, products, or services does not imply an endorsement by the U.S. government, the U.S. Environmental Protection Agency, or the U.S. National Oceanic and Atmospheric Administration. The U.S. Environmental Protection Agency and the U.S. National Oceanic and Atmospheric Administration do not endorse any commercial products, services, or enterprises.

What Will We Cover?

  • Foundations of spatial data
  • Visualizing spatial data
  • Geoprocessing spatial data
  • Advanced applications
  • Focus on the R computing language

Today’s Agenda

  • 1:00pm - 1:35pm EDT: Introduction and Spatial Data Structures in R
  • 1:35pm - 2:05pm EDT: Vector Data Model
  • 2:05pm - 2:10pm EDT: Break
  • 2:10pm - 2:30pm EDT: Raster Data Model
  • 2:30pm - 3:15pm EDT: Coordinate Reference Systems and I / O
  • 3:15pm - 3:25pm EDT: Break

Today’s Agenda

  • 3:25pm - 3:45pm EDT: Spatial Data Visualization
  • 3:45pm - 4:15pm EDT: Geoprocecessing
  • 4:15pm - 4:20pm EDT: Break
  • 4:20pm - 5:00pm EDT: Advanced Applications

Foundations

Goals

  1. Understand fundamental spatial data structures and libraries in R.
  2. Become familiar with coordinate reference systems.
  3. Geographic I/O (input/output).

Data Structures

  • Review data structures section in workbook
  • Cover vectors, matrices, arrays, lists, data frames, etc.
  • Cover data frame manipulation using tidyverse functions and the pipe operator (%>% or |>)

Why R for Spatial Data Analysis?

  • Lightweight, open-source, and cross-platform
  • Provides tools for automation and reproducibility
  • Seamlessly integrates spatial data analysis and statistical analysis in one environment
  • Handles vector and raster data

R Drawbacks for GIS Work

  • R is less interactive than desktop applications like ArcGIS or QGIS
  • Handling coordinate reference systems is more challenging
  • In-memory analysis can be prohibitive for large data
  • Steep learning curve

A Motivating Example

  • Simple, expressive code
library(tmap)
library(tmaptools)
library(dplyr)

# find my town!
my_town <- tmaptools::geocode_OSM("Corvallis OR USA", as.sf = TRUE)
glimpse(my_town)
Rows: 1
Columns: 9
$ query   <chr> "Corvallis OR USA"
$ lat     <dbl> 44.56457
$ lon     <dbl> -123.262
$ lat_min <dbl> 44.51993
$ lat_max <dbl> 44.60724
$ lon_min <dbl> -123.3362
$ lon_max <dbl> -123.231
$ bbox    <POLYGON [°]> POLYGON ((-123.3362 44.5199...
$ point   <POINT [°]> POINT (-123.262 44.56457)

Your Turn

  1. What does :: do?
  2. What does geocode_OSM() do?
  3. Explain how the code runs together using the |> chaining operator
  4. What is glimpse()? How is it useful compared to head()?
05:00

Our Turn

  1. :: lets you clarify from which package a function is called (e.g., tmaptools::geocodeOSM())
  2. geocode_OSM() looks up a named feature in OpenStreetMap and returns the coordinates
  3. Do this |> then that
  4. glimpse() glimpses at a data frame

A Choropleth Map

  • A choropleth map is a map that connects statistical summaries to geographic characteristics
  • The tigris package can retrieve census, county, and state boundaries as vector sf data
library(tigris)
library(sf)
counties <- counties("Oregon", cb = TRUE)
counties$area <- as.numeric(st_area(counties))
glimpse(counties)

tm_shape(counties) +
  tm_polygons("area", style="quantile", title="Area of Oregon Counties")

A Choropleth Map

Figure 1: Distribution of Oregon counties.

Your Turn

Create and explore an interactive map of my_town by running

library(mapview)
mapviewOptions(fgb=FALSE)
mapview(my_town, col="red", col.regions = "red") +
  mapview(counties, alpha.regions = .1)
04:00

Our Turn

Figure 2: Corvallis among distribution of Oregon counties.

Spatial Data Structures

  • Spatial data structures are primarily organized through:
  1. GDAL link here: For raster and feature abstraction and processing
  2. PROJ link here: For coordinate transformations and projections
  3. GEOS link here: For spatial operations (e.g., calculating buffers or centroids) on data with a projected coordinate reference system (CRS)
  • We explore these core libraries throughout the workshop

Types of Spatial Data

  • Vector data are comprised of points, lines, and polygons that represent discrete spatial entities (e.g., river, wtaershed, stream gauge)
  • Raster data divide space into small rectangles (pixels) that represent spatially continuous phenomena like elevation or precipitation

Types of Spatial Data

Figure 3: Vector and raster data.

Vector Data Model

  • Vector data are described using simple features
  • Simple features is a standard way to specify how two-dimensional geometries are represented, stored in and retrieved from databases, and which geometric operations can be performed
  • Provides framework for extending spatial POINTS to LINES and POLYGONS

Types of Spatial Data

Figure 4: The simple features data model.

The sf Package

  • We use the sf package to work with spatial vector data in R
  • Learn more about sf at their website r-spatial.github.io/sf and/or by running
vignette(package = "sf") # see which vignettes are available
vignette("sf1")          # open the introduction vignette

sf Primitives

  • There are three basic geometric primitives with corresponding sf functions to create them:
  1. POINT: created with st_points()
  2. LINESTRING: created with st_linestring()
  3. POLYGON: created with st_polygon()
  • More complex simple feature geometries are built up using st_multipoint(), st_multilinestring(), st_multipolygon(), and st_geometrycollection()
  • Building blocks for more complicated/large data structures

POINTs

  • Points in sf are composed of coordinate pairs and a coordinate reference system
  • Points have no length or area
# create sf object as point
pt1 <- st_point(c(1,3))
pt2 <- st_point(c(2,1))
pt3 <- st_point(c(3,2))
sf_point <- st_sfc(pt1, pt2, pt3)
ggplot(sf_point) + 
  geom_sf(color = "red") +
  labs(x = "X", y = "Y")

POINTs

Figure 5: sf POINT

LINESTRINGs

  • An ordered sequence of two or more POINTs
  • Points in a line are called verticies (or nodes)
  • Linestrings have length but no area
# create sf object as linestring
line1 <- sf::st_linestring(rbind(pt1, pt2))
line2 <- sf::st_linestring(rbind(pt2, pt3))
sf_linestring <- st_sfc(line1, line2)
ggplot(sf_linestring) + 
  geom_sf() +
  labs(x = "X", y = "Y")

LINESTRINGs

Figure 6: sf LINESTRING

POLYGONs

  • Four or more points with the same starting and ending point
  • Has length and area
pt4 <- pt1
poly1 <- st_polygon(list(rbind(pt1, pt2, pt3, pt4)))
sf_polygon <- st_sfc(poly1)

# dimension
sf::st_dimension(sf_polygon)
[1] 2
ggplot(sf_polygon) + 
  geom_sf(fill = "grey")

POLYGONs

Figure 7: sf POLYGON

Putting Them All Together

ggplot() +
  geom_sf(data = sf_point, color = "red") +
  geom_sf(data = sf_linestring) +
  geom_sf(data = sf_polygon, fill = "grey")

Putting Them All Together

Figure 8: sf POINT, LINESTRING, and POLYGON overlain

Spatial Data Frames

  • sf objects combine data.frame attributes with POINT, LINESTRING, and POLYGON building blocks (these are the geometry)
  • Also characterize dimensionality, bounding box, coordinate reference system, attribute headers
  • Read from shapefiles using st_read() or read_sf()
library(spData)
data("us_states")
print(us_states)

Spatial Data Frames

Simple feature collection with 49 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
Geodetic CRS:  NAD83
First 10 features:
   GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
5     12     Florida    South 151052.01 [km^2]     18511620     19645772
6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
                         geometry
1  MULTIPOLYGON (((-88.20006 3...
2  MULTIPOLYGON (((-114.7196 3...
3  MULTIPOLYGON (((-109.0501 4...
4  MULTIPOLYGON (((-73.48731 4...
5  MULTIPOLYGON (((-81.81169 2...
6  MULTIPOLYGON (((-85.60516 3...
7  MULTIPOLYGON (((-116.916 45...
8  MULTIPOLYGON (((-87.52404 4...
9  MULTIPOLYGON (((-102.0517 4...
10 MULTIPOLYGON (((-92.01783 2...

Spatial Data Frames

Figure 9: The simple features data structure.

Your Turn

So far we have explored plotting using ggplot(), but you can also use sf’s base plotting via plot(). Read more about plotting in sf by running ?plot.sf and then make an appropriate plot of the us_states data.

05:00

Our Turn

Running plot() in sf plots all variables; common to subset

plot(us_states["AREA"])

Our Turn

Figure 10: Area by state in the US.

Raster Data Model

  • Raster data can be continuous or categorical
  • Can be image based; have a temporal component

Raster Data Model

Figure 11: The raster data model.

The terra Package

  • We use the terra package to work with spatial raster data in R
  • terra is the modern replacement to raster
  • Learn more about terra at their website https://rspatial.org/
  • stars for spatio-temporal rasters

Creating a Raster Object

  • Create an empty SpatRaster object; define rows, columns, bounding box
library(terra)
myrast <- rast(ncol = 10, nrow = 10, xmax = -116, xmin = -126, ymin = 42, ymax = 46)
str(myrast)
S4 class 'SpatRaster' [package "terra"]
print(myrast)
class       : SpatRaster 
dimensions  : 10, 10, 1  (nrow, ncol, nlyr)
resolution  : 1, 0.4  (x, y)
extent      : -126, -116, 42, 46  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 

Your Turn

We just printed myrast and saw several components: class, dimensions, resolution, extent, and coord. ref. Explicitly return and inspect each of these pieces using class(), ncell(), res(), ext(), and crs(). Bonus: Return the number of raster cells using ncell().

05:00

Our Turn

class(myrast)
[1] "SpatRaster"
attr(,"package")
[1] "terra"
ncell(myrast)
[1] 100
res(myrast)
[1] 1.0 0.4
ext(myrast)
SpatExtent : -126, -116, 42, 46 (xmin, xmax, ymin, ymax)
crs(myrast)
[1] "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ID[\"EPSG\",6326]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433],\n        ID[\"EPSG\",8901]],\n    CS[ellipsoidal,2],\n        AXIS[\"longitude\",east,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433,\n                ID[\"EPSG\",9122]]],\n        AXIS[\"latitude\",north,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433,\n                ID[\"EPSG\",9122]]]]"
ncell(myrast)
[1] 100

Manipulating Raster Objects

  • Let’s add values to the raster object
values(myrast) <- 1:ncell(myrast)
plot(myrast)

Manipulating Raster Objects

Figure 12: Raster plot.

Reading a Raster Object

  • Lets read in a raster object from the spDataLarge package that contains elevation data from Zion National Park
raster_filepath <- system.file("raster/srtm.tif", package = "spDataLarge")
my_rast <- rast(raster_filepath)
plot(my_rast)

Reading a Raster Object

Figure 13: Raster plot of elevation from Zion National Park.

Your Turn

The spatRaster object in terra can hold multiple layers. Read in the four bands of Landsat 8 data from Zion National Park by running

landsat <- system.file("raster/landsat.tif", package = "spDataLarge")
landsat <- rast(landsat)

Then plot this raster using plot().

03:00

Our Turn

plot(landsat)

Figure 14: Raster plot of Landsat 8 data from Zion National Park.

Spatio-Temporal Raster Objects

  • The stars package has tools for spatio-temporal raster data
  • stars integrates with sf (recall that terra does not)
  • Learn more at their website https://r-spatial.github.io/stars/

Coordinate Reference Systems

A coordinate reference system (CRS) is made up of several components

  1. Coordinate system: The x,y grid that defines where your data lies in space
  2. Horizontal and vertical units: The units that describe the coordinate system
  3. Datum: The modeled version of the earth’s shape
  4. Projection details (if applicable): The mathematical equation used to flatten objects from round surface (earth) to flat surface (paper or screen)

Properly specifying coordinate reference systems is a crucial step in handling or analyzing spatial data

The Ellipsoid and Geoid

  • The earth can be thought of as an ellipsoid defined by a semi-major (equatorial) axis and a semi-minor (polar) axis
  • More precisely, it is a geoid that is not perfectly smooth
  • A datum is built on top of the ellipsoid and incorporates local features
  • The most appropriate datum depends on the location of interest (in the US we typically use NAD83)

The Ellipsoid and Geoid

Figure 15: The ellipsoid and geoid.

Your Turn

There are several websites that are helpful for learning more about specific coordinate reference systems

  1. Spatial Reference: https://spatialreference.org/
  2. Projection Wizard: https://projectionwizard.org/
  3. EPSG: https://epsg.io/

Spend a few minutes learning about one or more of these references by visiting their websites.

04:00

Specifying a CRS

There are many ways to specify a CRS:

  1. Proj4
    • +proj = longlat + ellps=WGS84 ...
  2. OGC WKT / ESRI WKT
    • GEOGCS["WGS 84", DATUM["WGS_1984, SPHERIOD["WGS 84 ...]]]
  3. authority:code identifier (modern/preferred)
    • EPSG: 4326

Projected Coordinate Systems

  • Projected coordinates have been projected to two-dimensional space according to a CRS
  • Projected coordinates have an origin, x-axis, y-axis, and unit of measure
  • Conformal projections preserve shape
  • Equal-area projections preserve area
  • Equidistant projections preserve distance
  • Azimuthal projection preserves direction

Projected Coordinate Systems

  • Here an example using vector data; see the workbook for an example using raster data
library(Rspatialworkshop)
data("pnw")
# transform one to the other
utmz11 <- 2153
pnw <- st_transform(pnw, crs = utmz11)
ggplot() + 
  geom_sf(data = pnw,  color="black", fill=NA) +
  labs(title="State Boundaries in the Pacific Northwest") +
  theme_bw() 

Projected Coordinate Systems

Figure 16: PNW State Boundaries.

Your Turn

Re-project the pnw data to a different projected CRS. Then plot using base R or ggplot2.

04:00

Our Turn

Project to NAD83

pnw2 <- st_transform(pnw, crs = 5070)
plot(st_geometry(pnw2))

A Note on S2

  • sf version 1.0.0 supports spherical geometry operations via its interface to Google’s S2 spherical geometry engine
  • S2 is an example of a Discrete Global Grid System (DGGS)
  • sf can run with s2 on or off and by default the S2 geometry engine is turned on:
sf::sf_use_s2()

Geographic Data I/O (Input/Ouptut)

There are several ways to read spatial data into R

  1. Load spatial data from our machine or a remote source
  2. Load spatial data as part of an R package
  3. Load data using an API (which often makes use of an R package)
  4. Convert flat files with x, y coordinates to spatial data
  5. Geocoding data “by-hand” (we saw this earlier)

Vector Data I/O

sf can read numerous file types:

  1. shapefiles
  2. geodatabases
  3. geopackages
  4. geojson
  5. spatial database files

Read in a Shapefile

filepath <- system.file("extdata/city_limits.shp", package = "Rspatialworkshop")
citylims <- read_sf(filepath)
plot(st_geometry(citylims), axes = T, main = "Oregon City Limits")

Read in a Shapefile

Figure 17: Oregon City Limits

Your Turn

Run ?read_sf() and compare read_sf() to st_read(). Our preference is to use read_sf() – why do you think that is?

03:00

Our Turn

read_sf() calls st_read() with defaults chosen:

  • stringsAsFactors = FALSE
  • quiet = TRUE
  • as_tibble = TRUE

Read in a Geodatabase

filepath <- system.file("extdata/StateParkBoundaries.gdb", package = "Rspatialworkshop")
# List all feature classes in a file geodatabase
st_layers(filepath)
# Read the feature class
parks <- st_read(dsn = filepath, layer="StateParkBoundaries")
ggplot(parks) + 
  geom_sf()

Read in a Geodatabase

Figure 18: State Park Boundaries

Read in a Geopackage

filepath <- system.file("gpkg/nc.gpkg", package="sf")
nc <- read_sf(filepath)
ggplot(nc) + 
  geom_sf()

Read in a Geopackage

Figure 19: North Carolina Counties

Geopackage Advantages

Why use geopackages over shapefiles?

  1. Geopackages avoid mult-file format of shapefiles
  2. Geopackages avoid the 2gb limit of shapefiles
  3. Geopackages are open-source and follow OGC standards
  4. Geopackages are lighter in file size than shapefiles
  5. Geopackages avoid the 10-character limit to column headers in shapefile attribute tables (stored in archaic .dbf files)

Other Approaches

The workbook shows how to read in data using

  1. Open spatial data sources
  2. R packages
  3. OpenStreetMap

Raster Data I/O

  • Read in data using rast()
filepath <- system.file("ex/elev.tif", package="terra")
elev <- rast(filepath)
barplot(elev, digits = -1, las = 2, ylab = "Frequency")

Raster Data I/O

Figure 20: Elevation Data Barplot

Raster Data I/O

plot(elev)

Figure 21: Elevation Data

Raster Data I/O

library(stars)
filepath <- system.file("tif/L7_ETMs.tif", package = "stars")
L7 <- read_stars(filepath) |>
  slice(index = 1, along = "band")
plot(L7)

Raster Data I/O

Figure 22: Landsat 7

Convert Flat Files to Spatial

It is common for a flat file (e.g., a .csv) to have columns that indicate coordinates – how do we turn this into a spatial vector object?

filepath <- system.file("extdata/Gages_flowdata.csv", package = "Rspatialworkshop")
gages <- read.csv(filepath)
gages_sf <- gages |>
  st_as_sf(coords = c("LON_SITE", "LAT_SITE"), crs = 4269, remove = FALSE)
ggplot(data = gages_sf) + 
  geom_sf()

Convert Flat Files to Spatial

Figure 23: Stream Gages From a Flat File